Tutorial for gwaslab¶
- In this tutorial, we will perform a common pipeline for sumstats QC, standardization and harmonization.
- We will also show examples of visualization.
- Using a jupyter notebook, we first download sample data.
Download sample data¶
The data we will use as an example: sumstats of type 2 diabetes from BBJ
!wget -O t2d_bbj.txt.gz http://jenger.riken.jp/14/
--2022-10-21 15:41:14-- http://jenger.riken.jp/14/ Resolving jenger.riken.jp (jenger.riken.jp)... 134.160.84.25 Connecting to jenger.riken.jp (jenger.riken.jp)|134.160.84.25|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 274187574 (261M) [text/plain] Saving to: ‘t2d_bbj.txt.gz’ t2d_bbj.txt.gz 100%[===================>] 261.49M 38.0MB/s in 7.3s 2022-10-21 15:41:22 (36.0 MB/s) - ‘t2d_bbj.txt.gz’ saved [274187574/274187574]
Import gwaslab package¶
If you download it from pip, simply run:
import gwaslab as gl
Or if you want to use the latest version from github (beta version) you can clone the repo and import like:
import sys
sys.path.insert(0,"/Users/he/work/gwaslab/src")
import gwaslab as gl
Loading data into gwaslab Sumstats¶
Let's import this raw sumstats into the gwaslab Sumstats Object by specifying the necessary columns.
Note: you can either specify eaf (effect allele frequency) or neaf (non-effect allele frequency), if neaf is specified, it will be converted to eaf when loading sumstats.
mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
snpid="SNP",
chrom="CHR",
pos="POS",
ea="ALT",
nea="REF",
neaf="Frq",
beta="BETA",
se="SE",
p="P",
direction="Dir",
n="N")
Tue Dec 13 17:08:19 2022 Start to initiate from file :t2d_bbj.txt.gz Tue Dec 13 17:08:51 2022 -Reading columns : ALT,SE,N,Dir,SNP,Frq,BETA,CHR,REF,P,POS Tue Dec 13 17:08:51 2022 -Renaming columns to : EA,SE,N,DIRECTION,SNPID,EAF,BETA,CHR,NEA,P,POS Tue Dec 13 17:08:51 2022 -Current Dataframe shape : 12557761 x 11 Tue Dec 13 17:08:58 2022 -Initiating a status column: STATUS ... Tue Dec 13 17:09:03 2022 -NEAF is specified... Tue Dec 13 17:09:03 2022 -Checking if 0<= NEAF <=1 ... Tue Dec 13 17:09:07 2022 -Converted NEAF to EAF. Tue Dec 13 17:09:07 2022 -Removed 0 variants with bad NEAF. Tue Dec 13 17:09:07 2022 Start to reorder the columns... Tue Dec 13 17:09:07 2022 -Current Dataframe shape : 12557761 x 12 Tue Dec 13 17:09:07 2022 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS Tue Dec 13 17:09:08 2022 Finished sorting columns successfully! Tue Dec 13 17:09:09 2022 Finished loading data successfully!
Sumstats are stored in Sumstats.data as apandas.DataFrame, you can check the data by:
mysumstats.data
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1:725932_G_A | 1 | 725932 | G | A | 0.9960 | -0.0737 | 0.1394 | 0.5970 | 166718 | -?+- | 9999999 |
| 1 | 1:725933_A_G | 1 | 725933 | G | A | 0.0040 | 0.0737 | 0.1394 | 0.5973 | 166718 | +?-+ | 9999999 |
| 2 | 1:737801_T_C | 1 | 737801 | C | T | 0.0051 | 0.0490 | 0.1231 | 0.6908 | 166718 | +?-+ | 9999999 |
| 3 | 1:749963_T_TAA | 1 | 749963 | TAA | T | 0.8374 | 0.0213 | 0.0199 | 0.2846 | 166718 | -?++ | 9999999 |
| 4 | 1:751343_T_A | 1 | 751343 | T | A | 0.8593 | 0.0172 | 0.0156 | 0.2705 | 166718 | -?++ | 9999999 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12557756 | X:154874837_A_G | X | 154874837 | G | A | 0.7478 | -0.0064 | 0.0117 | 0.5840 | 191764 | -+-+ | 9999999 |
| 12557757 | X:154875192_GTACTC_G | X | 154875192 | GTACTC | G | 0.2525 | 0.0071 | 0.0122 | 0.5612 | 191764 | +-+- | 9999999 |
| 12557758 | X:154879115_A_G | X | 154879115 | G | A | 0.7463 | -0.0070 | 0.0122 | 0.5646 | 191764 | -+-+ | 9999999 |
| 12557759 | X:154880669_T_A | X | 154880669 | T | A | 0.2558 | 0.0071 | 0.0122 | 0.5618 | 191764 | +-+- | 9999999 |
| 12557760 | X:154880917_C_T | X | 154880917 | C | T | 0.2558 | 0.0072 | 0.0122 | 0.5570 | 191764 | +-+- | 9999999 |
12557761 rows × 12 columns
For details on gwaslab Sumstats Object, see: https://cloufield.github.io/gwaslab/SumstatsObject/
Create Manhattan and Q-Q plot¶
Maybe the first thing you want to check is the manhattan and qq plots. gwaslab will do a minimum QC for sumstats when plotting.
For details on Manhattan and Q-Q plot, see: https://cloufield.github.io/gwaslab/Visualization/
# skip : skip variants with MLOG10P<2 for faster plotting speed
mysumstats.plot_mqq(skip=2)
Tue Dec 13 17:09:10 2022 Start to plot manhattan/qq plot with the following basic settings: Tue Dec 13 17:09:10 2022 -Genome-wide significance level is set to 5e-08 ... Tue Dec 13 17:09:10 2022 -Raw input contains 12557761 variants... Tue Dec 13 17:09:10 2022 -Plot layout mode is : mqq Tue Dec 13 17:09:13 2022 Finished loading specified columns from the sumstats. Tue Dec 13 17:09:13 2022 Start conversion and sanity check: Tue Dec 13 17:09:14 2022 -Removed 0 variants with nan in CHR or POS column ... Tue Dec 13 17:09:14 2022 -Removed 0 variants with nan in P column ... Tue Dec 13 17:09:14 2022 -P values are being converted to -log10(P)... Tue Dec 13 17:09:14 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Tue Dec 13 17:09:17 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Tue Dec 13 17:09:19 2022 -Maximum -log10(P) values is 167.58838029403677 . Tue Dec 13 17:09:19 2022 Finished data conversion and sanity check. Tue Dec 13 17:09:20 2022 Start to create manhattan plot with 332882 variants: Tue Dec 13 17:09:22 2022 -Found 89 significant variants with a sliding window size of 500 kb... Tue Dec 13 17:09:22 2022 Finished creating Manhattan plot successfully! Tue Dec 13 17:09:22 2022 -Skip annotating Tue Dec 13 17:09:22 2022 Start to create QQ plot with 332882 variants: Tue Dec 13 17:09:23 2022 -Calculating lambda GC: 1.2128258399293808 Tue Dec 13 17:09:23 2022 Finished creating QQ plot successfully!
(<Figure size 3000x1000 with 2 Axes>, <gwaslab.Log.Log at 0x7fa69cfbd580>)
Standardization & QC : basic_check()¶
It is needed to check SNP ID, rsID, chromosome, basepair position, alleles and statistics first before any manipulations (sometimes you need do this before plotting if the sumstats is in a really messy format):
#check SNPID,rsID,CHR,POS,EA, NEA and statistics
mysumstats.basic_check()
Tue Dec 13 17:09:29 2022 Start to check IDs... Tue Dec 13 17:09:29 2022 -Current Dataframe shape : 12557761 x 12 Tue Dec 13 17:09:29 2022 -Checking if SNPID is chr:pos:ref:alt...(separator: - ,: , _) Tue Dec 13 17:09:50 2022 Finished checking IDs successfully! Tue Dec 13 17:09:50 2022 Start to fix chromosome notation... Tue Dec 13 17:09:50 2022 -Current Dataframe shape : 12557761 x 12 Tue Dec 13 17:10:30 2022 -Vairants with standardized chromosome notation: 12228970 Tue Dec 13 17:11:09 2022 -Vairants with fixable chromosome notations: 328791 Tue Dec 13 17:11:09 2022 -No unrecognized chromosome notations... Tue Dec 13 17:11:13 2022 -Identified 328791 variants on sex chromosomes... Tue Dec 13 17:11:13 2022 -Standardizing sex chromosome notations: X Y MT to 23,24,25... Tue Dec 13 17:11:44 2022 Finished fixing chromosome notation successfully! Tue Dec 13 17:11:44 2022 Start to fix basepair positions... Tue Dec 13 17:11:44 2022 -Current Dataframe shape : 12557761 x 12 Tue Dec 13 17:11:44 2022 -Converting to Int64 data type ... Tue Dec 13 17:12:10 2022 -Position upper_bound is: 250,000,000 Tue Dec 13 17:12:49 2022 -Remove outliers: 0 Tue Dec 13 17:13:03 2022 -Converted all position to datatype Int64. Tue Dec 13 17:13:03 2022 Finished fixing basepair position successfully! Tue Dec 13 17:13:03 2022 Start to fix alleles... Tue Dec 13 17:13:03 2022 -Current Dataframe shape : 12557761 x 12 Tue Dec 13 17:13:10 2022 -Detected 0 variants with alleles that contain bases other than A/C/T/G . Tue Dec 13 17:13:13 2022 -Converted all bases to string datatype and UPPERCASE. Tue Dec 13 17:13:56 2022 Finished fixing allele successfully! Tue Dec 13 17:13:56 2022 Start sanity check for statistics ... Tue Dec 13 17:13:56 2022 -Current Dataframe shape : 12557761 x 12 Tue Dec 13 17:14:06 2022 -Checking if 0 <=N<= inf ... Tue Dec 13 17:14:17 2022 -Removed 0 variants with bad N. Tue Dec 13 17:14:17 2022 -Checking if 0 <=EAF<= 1 ... Tue Dec 13 17:14:22 2022 -Removed 0 variants with bad EAF. Tue Dec 13 17:14:22 2022 -Checking if 5 <=MAC<= inf ... Tue Dec 13 17:14:28 2022 -Removed 0 variants with bad MAC. Tue Dec 13 17:14:28 2022 -Checking if 5e-300 <= P <= 1 ... Tue Dec 13 17:14:31 2022 -Removed 0 variants with bad P. Tue Dec 13 17:14:31 2022 -Checking if -10 <BETA)< 10 ... Tue Dec 13 17:14:36 2022 -Removed 0 variants with bad BETA. Tue Dec 13 17:14:36 2022 -Checking if 0 <SE< inf ... Tue Dec 13 17:14:40 2022 -Removed 0 variants with bad SE. Tue Dec 13 17:14:40 2022 -Checking STATUS... Tue Dec 13 17:14:43 2022 -Coverting STAUTUS to interger. Tue Dec 13 17:14:46 2022 -Removed 0 variants with bad statistics in total. Tue Dec 13 17:14:46 2022 Finished sanity check successfully! Tue Dec 13 17:14:48 2022 Start to normalize variants... Tue Dec 13 17:14:48 2022 -Current Dataframe shape : 12557761 x 12 Tue Dec 13 17:15:08 2022 -Not normalized allele IDs:X:7151130_TT_TC X:9093382_CTTTT_CTTT X:12292253_ATTT_ATT X:16001576_ATT_ATTT X:33822416_GT_GTT ... Tue Dec 13 17:15:08 2022 -Not normalized allele:['TT' 'TC']['CTTTT' 'CTTT']['ATTT' 'ATT']['ATTT' 'ATT']['GTT' 'GT']... Tue Dec 13 17:15:08 2022 -Modified 13 variants according to parsimony and left alignment principal. Tue Dec 13 17:15:09 2022 Finished normalizing variants successfully!
mysumstats.data
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1:725932_G_A | 1 | 725932 | G | A | 0.9960 | -0.0737 | 0.1394 | 0.5970 | 166718 | -?+- | 9960099 |
| 1 | 1:725933_A_G | 1 | 725933 | G | A | 0.0040 | 0.0737 | 0.1394 | 0.5973 | 166718 | +?-+ | 9960099 |
| 2 | 1:737801_T_C | 1 | 737801 | C | T | 0.0051 | 0.0490 | 0.1231 | 0.6908 | 166718 | +?-+ | 9960099 |
| 3 | 1:749963_T_TAA | 1 | 749963 | TAA | T | 0.8374 | 0.0213 | 0.0199 | 0.2846 | 166718 | -?++ | 9960399 |
| 4 | 1:751343_T_A | 1 | 751343 | T | A | 0.8593 | 0.0172 | 0.0156 | 0.2705 | 166718 | -?++ | 9960099 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12557756 | X:154874837_A_G | 23 | 154874837 | G | A | 0.7478 | -0.0064 | 0.0117 | 0.5840 | 191764 | -+-+ | 9960099 |
| 12557757 | X:154875192_GTACTC_G | 23 | 154875192 | GTACTC | G | 0.2525 | 0.0071 | 0.0122 | 0.5612 | 191764 | +-+- | 9960399 |
| 12557758 | X:154879115_A_G | 23 | 154879115 | G | A | 0.7463 | -0.0070 | 0.0122 | 0.5646 | 191764 | -+-+ | 9960099 |
| 12557759 | X:154880669_T_A | 23 | 154880669 | T | A | 0.2558 | 0.0071 | 0.0122 | 0.5618 | 191764 | +-+- | 9960099 |
| 12557760 | X:154880917_C_T | 23 | 154880917 | C | T | 0.2558 | 0.0072 | 0.0122 | 0.5570 | 191764 | +-+- | 9960099 |
12557761 rows × 12 columns
By checking the log, the sumstats look good. But we found several variants that were not normalizaed and gwaslab fixed this.
.basic_check() is a wrapper of all the following basic functions, you can use these separately.
- mysumstats.fix_ID()
- mysumstats.fix_chr()
- mysumstats.fix_pos()
- mysumstats.fix_allele()
- mysumstats.check_sanity()
- mysumstats.normalize_allele()
For other options, see: https://cloufield.github.io/gwaslab/Standardization/
Extract lead variants : get_lead()¶
Let's extract the lead variants in each significant loci to check our data.
The significant loci are detected based on a sliding window (default window size: windowsizekb=500 kb)
By specifying anno=True , gwaslab will also annotate the lead variant with its nearest gene names and distance.
Note: gwaslab default genome build version is build="19" (GRCh37/hg19), you can change it to build="38" (GRCh38/hg38) when needed.
mysumstats.get_lead(anno=True)
Tue Dec 13 17:15:10 2022 Start to extract lead variants... Tue Dec 13 17:15:10 2022 -Processing 12557761 variants... Tue Dec 13 17:15:10 2022 -Significance threshold : 5e-08 Tue Dec 13 17:15:10 2022 -Sliding window size: 500 kb Tue Dec 13 17:15:16 2022 -Found 9461 significant variants in total... Tue Dec 13 17:15:17 2022 -Identified 89 lead variants! Tue Dec 13 17:15:17 2022 Start to annotate variants with nearest gene name(s)... Tue Dec 13 17:15:17 2022 -Assigning Gene name using Ensembl Release hg19 Tue Dec 13 17:15:22 2022 Finished annotating variants with nearest gene name(s) successfully! Tue Dec 13 17:15:22 2022 Finished extracting lead variants successfully!
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | LOCATION | GENE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 96739 | 1:22068326_A_G | 1 | 22068326 | G | A | 0.7550 | 0.0621 | 0.0103 | 1.629000e-09 | 191764 | ++++ | 9960099 | 0 | USP48 |
| 213860 | 1:51103268_T_C | 1 | 51103268 | C | T | 0.7953 | -0.0802 | 0.0120 | 2.519000e-11 | 191764 | ---- | 9960099 | 0 | FAF1 |
| 534095 | 1:154309595_TA_T | 1 | 154309595 | TA | T | 0.0947 | -0.0915 | 0.0166 | 3.289000e-08 | 191764 | ---- | 9960399 | 0 | ATP8B2 |
| 969974 | 2:640986_CACAT_C | 2 | 640986 | C | CACAT | 0.9006 | -0.0946 | 0.0150 | 2.665000e-10 | 191764 | ---- | 9960399 | 26349 | TMEM18 |
| 1091807 | 2:27734972_G_A | 2 | 27734972 | G | A | 0.5605 | 0.0691 | 0.0088 | 3.897000e-15 | 191764 | ++++ | 9960099 | 0 | GCKR |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12272930 | X:21569920_A_G | 23 | 21569920 | G | A | 0.3190 | 0.0423 | 0.0076 | 2.616000e-08 | 191764 | ++++ | 9960099 | 0 | CNKSR2 |
| 12341406 | X:48724648_CAA_C | 23 | 48724648 | C | CAA | 0.6260 | -0.0602 | 0.0103 | 4.576000e-09 | 191764 | ---- | 9960399 | 26082 | TIMM17B |
| 12350767 | X:57170781_A_AT | 23 | 57170781 | AT | A | 0.3003 | -0.0447 | 0.0076 | 4.583000e-09 | 191764 | ---- | 9960399 | -6723 | SPIN2A |
| 12469290 | X:117915163_T_TA | 23 | 117915163 | TA | T | 0.5560 | 0.0548 | 0.0071 | 9.818000e-15 | 191764 | ++++ | 9960399 | 0 | IL13RA1 |
| 12554976 | X:152908887_G_A | 23 | 152908887 | G | A | 0.6792 | -0.1235 | 0.0077 | 9.197000e-58 | 191764 | ---- | 9960099 | 0 | DUSP9 |
89 rows × 14 columns
For other options, see: https://cloufield.github.io/gwaslab/ExtractLead/
Use the SNPID to create some highly customized mqq plot¶
Gwaslab can create more complicated manhattan plot.
plot and save as my_first_mqq_plot.png with {"dpi":400,"facecolor":"white"}
mysumstats.plot_mqq(mode="mqq",
cut=20,skip=3,
anno="GENENAME",
anno_set=["9:22132729_A_G","6:20688121_T_A","9:22132729_A_G","15:62394264_G_C"] ,
pinpoint=["9:22132729_A_G","5:176513896_C_A"],
highlight=["7:127253550_C_T","19:46166604_C_T"],
highlight_windowkb =1000,
stratified=True,
marker_size=(5,10),
figargs={"figsize":(15,5),"dpi":300},
save="my_first_mqq_plot.png",
saveargs={"dpi":400,"facecolor":"white"})
Tue Dec 13 17:15:23 2022 Start to plot manhattan/qq plot with the following basic settings: Tue Dec 13 17:15:23 2022 -Genome-wide significance level is set to 5e-08 ... Tue Dec 13 17:15:23 2022 -Raw input contains 12557761 variants... Tue Dec 13 17:15:23 2022 -Plot layout mode is : mqq Tue Dec 13 17:15:23 2022 -Variants to annotate : 9:22132729_A_G,6:20688121_T_A,9:22132729_A_G,15:62394264_G_C Tue Dec 13 17:15:23 2022 -Loci to highlight : 7:127253550_C_T,19:46166604_C_T Tue Dec 13 17:15:23 2022 -Highlight_window is set to: 1000 kb Tue Dec 13 17:15:23 2022 -Variants to pinpoint : 9:22132729_A_G,5:176513896_C_A Tue Dec 13 17:15:31 2022 Finished loading specified columns from the sumstats. Tue Dec 13 17:15:31 2022 Start conversion and sanity check: Tue Dec 13 17:15:34 2022 -Removed 0 variants with nan in CHR or POS column ... Tue Dec 13 17:15:36 2022 -Removed 0 variants with nan in EAF column ... Tue Dec 13 17:15:38 2022 -Removed 0 variants with nan in P column ... Tue Dec 13 17:15:38 2022 -P values are being converted to -log10(P)... Tue Dec 13 17:15:38 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Tue Dec 13 17:15:41 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Tue Dec 13 17:15:45 2022 -Maximum -log10(P) values is 167.58838029403677 . Tue Dec 13 17:15:45 2022 -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10... Tue Dec 13 17:15:45 2022 Finished data conversion and sanity check. Tue Dec 13 17:15:45 2022 Start to create manhattan plot with 91234 variants: Tue Dec 13 17:15:45 2022 -Highlighting target loci... Tue Dec 13 17:15:45 2022 -Pinpointing target vairants... Tue Dec 13 17:15:45 2022 -Found 3 specified variants to annotate... Tue Dec 13 17:15:45 2022 Start to annotate variants with nearest gene name(s)... Tue Dec 13 17:15:45 2022 -Assigning Gene name using Ensembl Release hg19 Tue Dec 13 17:15:46 2022 Finished annotating variants with nearest gene name(s) successfully! Tue Dec 13 17:15:46 2022 Finished creating Manhattan plot successfully! Tue Dec 13 17:15:46 2022 -Annotating using column GENENAME... Tue Dec 13 17:15:46 2022 Start to create QQ plot with 91234 variants: Tue Dec 13 17:15:48 2022 -Calculating lambda GC: 1.2128258399293808 Tue Dec 13 17:15:48 2022 Finished creating QQ plot successfully! Tue Dec 13 17:15:48 2022 Saving plot: Tue Dec 13 17:15:51 2022 -Saved to my_first_mqq_plot.png successfully!
(<Figure size 4500x1500 with 2 Axes>, <gwaslab.Log.Log at 0x7fa69cfbd580>)
For details, see: https://cloufield.github.io/gwaslab/Visualization/
Quick regional plot without LD-information¶
Gwaslab can plot regional plot with or with out LD reference files.
For details see: https://cloufield.github.io/gwaslab/RegionalPlot/
mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True, figargs= {"figsize":(15,5)})
Tue Dec 13 17:15:55 2022 Start to plot manhattan/qq plot with the following basic settings: Tue Dec 13 17:15:55 2022 -Genome-wide significance level is set to 5e-08 ... Tue Dec 13 17:15:55 2022 -Raw input contains 12557761 variants... Tue Dec 13 17:15:55 2022 -Plot layout mode is : r Tue Dec 13 17:15:55 2022 -Region to plot : chr7:156538803-157538803. Tue Dec 13 17:15:58 2022 -Extract SNPs in region : chr7:156538803-157538803... Tue Dec 13 17:16:38 2022 -Extract SNPs in specified regions: 5831 Tue Dec 13 17:16:39 2022 Finished loading specified columns from the sumstats. Tue Dec 13 17:16:39 2022 Start conversion and sanity check: Tue Dec 13 17:16:39 2022 -Removed 0 variants with nan in CHR or POS column ... Tue Dec 13 17:16:39 2022 -Removed 0 variants with nan in P column ... Tue Dec 13 17:16:39 2022 -P values are being converted to -log10(P)... Tue Dec 13 17:16:39 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Tue Dec 13 17:16:39 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Tue Dec 13 17:16:39 2022 -Maximum -log10(P) values is 7.948076083953893 . Tue Dec 13 17:16:39 2022 Finished data conversion and sanity check. Tue Dec 13 17:16:39 2022 Start to create manhattan plot with 5831 variants: Tue Dec 13 17:16:40 2022 -Loading gtf files from:default Tue Dec 13 17:17:00 2022 -plotting gene track.. Tue Dec 13 17:17:00 2022 -Finished plotting gene track.. Tue Dec 13 17:17:01 2022 -Found 1 significant variants with a sliding window size of 500 kb... Tue Dec 13 17:17:01 2022 Finished creating Manhattan plot successfully! Tue Dec 13 17:17:01 2022 -Skip annotating
(<Figure size 3000x2000 with 3 Axes>, <gwaslab.Log.Log at 0x7fa69cfbd580>)
reference download¶
Full regional plot using a user-provided vcf or preprocessed vcf: (e.g 1000 genome, see Reference: https://cloufield.github.io/gwaslab/Reference/)
check available reference¶
update the available reference list first
gl.update_available_ref()
Tue Dec 13 17:17:03 2022 Updating available_ref list from: https://raw.github.com/Cloufield/gwaslab/main/src/gwaslab/data/reference.json Tue Dec 13 17:17:19 2022 Available_ref list has been updated!
gl.check_available_ref()
Tue Dec 13 17:17:19 2022 Start to check available reference files... Tue Dec 13 17:17:19 2022 - 1kg_eas_hg19 : https://www.dropbox.com/s/lztaxqhy2o6dpxw/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1 Tue Dec 13 17:17:19 2022 - 1kg_eas_hg19_tbi : https://www.dropbox.com/s/k9klefl8m9fcfo1/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1 Tue Dec 13 17:17:19 2022 - 1kg_eur_hg19 : https://www.dropbox.com/s/1nbgqshknevseks/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1 Tue Dec 13 17:17:19 2022 - 1kg_eur_hg19_tbi : https://www.dropbox.com/s/vscvkrflh6fc5a0/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1 Tue Dec 13 17:17:19 2022 - 1kg_eas_hg38 : https://www.dropbox.com/s/3dstbbb1el9r3au/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1 Tue Dec 13 17:17:19 2022 - 1kg_eas_hg38_tbi : https://www.dropbox.com/s/vwnp5vd8dcqksn4/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1 Tue Dec 13 17:17:19 2022 - 1kg_eur_hg38 : https://www.dropbox.com/s/z0mkehg17lryapv/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1 Tue Dec 13 17:17:19 2022 - 1kg_eur_hg38_tbi : https://www.dropbox.com/s/ze8g58x75x9qbf0/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1 Tue Dec 13 17:17:19 2022 - dbsnp_v151_hg19 : https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz Tue Dec 13 17:17:19 2022 - dbsnp_v151_hg38 : https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz Tue Dec 13 17:17:19 2022 - ucsc_genome_hg19 : http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz Tue Dec 13 17:17:19 2022 - ucsc_genome_hg38 : https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz Tue Dec 13 17:17:19 2022 - 1kg_dbsnp151_hg19_auto : https://www.dropbox.com/s/37p2u1xwmux4gwo/1kg_dbsnp151_hg19_auto.txt.gz?dl=1 Tue Dec 13 17:17:19 2022 - 1kg_dbsnp151_hg38_auto : https://www.dropbox.com/s/ouf60n7gdz6cm0g/1kg_dbsnp151_hg38_auto.txt.gz?dl=1 Tue Dec 13 17:17:19 2022 - recombination_hg19 : https://www.dropbox.com/s/4b39o8m58m5yvqx/recombination_hg19.tar.gz?dl=1 Tue Dec 13 17:17:19 2022 - ensembl_hg19_gtf : https://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz Tue Dec 13 17:17:19 2022 - ensembl_hg19_gtf_protein_coding : https://www.dropbox.com/s/ovsqf4kyeniakxf/Homo_sapiens.GRCh37.75.protein_coding.gtf.gz?dl=1 Tue Dec 13 17:17:19 2022 - ensembl_hg38_gtf : https://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.108.chr.gtf.gz Tue Dec 13 17:17:19 2022 - ensembl_hg38_gtf_protein_coding : https://www.dropbox.com/s/37kicx0l6uifa1e/Homo_sapiens.GRCh38.107.protein_coding.chr.gtf.gz?dl=1 Tue Dec 13 17:17:19 2022 - refseq_hg19_gtf : https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz Tue Dec 13 17:17:19 2022 - refseq_hg19_gtf_protein_coding : https://www.dropbox.com/s/gpzflnid7zo3kca/GRCh37_latest_genomic.protein_coding.gtf.gz?dl=1 Tue Dec 13 17:17:19 2022 - refseq_hg38_gtf : https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz Tue Dec 13 17:17:19 2022 - refseq_hg38_gtf_protein_coding : https://www.dropbox.com/s/reojrsjmbwsu0pp/GRCh38_latest_genomic.protein_coding.gtf.gz?dl=1 Tue Dec 13 17:17:19 2022 - testlink : https://www.dropbox.com/s/8u7capwge0ihshu/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz?dl=1 Tue Dec 13 17:17:19 2022 - testlink_tbi : https://www.dropbox.com/s/hdneg53t6u1j6ib/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz.tbi?dl=1
{'1kg_eas_hg19': 'https://www.dropbox.com/s/lztaxqhy2o6dpxw/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1',
'1kg_eas_hg19_tbi': 'https://www.dropbox.com/s/k9klefl8m9fcfo1/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1',
'1kg_eur_hg19': 'https://www.dropbox.com/s/1nbgqshknevseks/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1',
'1kg_eur_hg19_tbi': 'https://www.dropbox.com/s/vscvkrflh6fc5a0/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1',
'1kg_eas_hg38': 'https://www.dropbox.com/s/3dstbbb1el9r3au/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1',
'1kg_eas_hg38_tbi': 'https://www.dropbox.com/s/vwnp5vd8dcqksn4/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1',
'1kg_eur_hg38': 'https://www.dropbox.com/s/z0mkehg17lryapv/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1',
'1kg_eur_hg38_tbi': 'https://www.dropbox.com/s/ze8g58x75x9qbf0/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1',
'dbsnp_v151_hg19': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz',
'dbsnp_v151_hg38': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz',
'ucsc_genome_hg19': 'http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz',
'ucsc_genome_hg38': 'https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz',
'1kg_dbsnp151_hg19_auto': 'https://www.dropbox.com/s/37p2u1xwmux4gwo/1kg_dbsnp151_hg19_auto.txt.gz?dl=1',
'1kg_dbsnp151_hg38_auto': 'https://www.dropbox.com/s/ouf60n7gdz6cm0g/1kg_dbsnp151_hg38_auto.txt.gz?dl=1',
'recombination_hg19': 'https://www.dropbox.com/s/4b39o8m58m5yvqx/recombination_hg19.tar.gz?dl=1',
'ensembl_hg19_gtf': 'https://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz',
'ensembl_hg19_gtf_protein_coding': 'https://www.dropbox.com/s/ovsqf4kyeniakxf/Homo_sapiens.GRCh37.75.protein_coding.gtf.gz?dl=1',
'ensembl_hg38_gtf': 'https://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.108.chr.gtf.gz',
'ensembl_hg38_gtf_protein_coding': 'https://www.dropbox.com/s/37kicx0l6uifa1e/Homo_sapiens.GRCh38.107.protein_coding.chr.gtf.gz?dl=1',
'refseq_hg19_gtf': 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz',
'refseq_hg19_gtf_protein_coding': 'https://www.dropbox.com/s/gpzflnid7zo3kca/GRCh37_latest_genomic.protein_coding.gtf.gz?dl=1',
'refseq_hg38_gtf': 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz',
'refseq_hg38_gtf_protein_coding': 'https://www.dropbox.com/s/reojrsjmbwsu0pp/GRCh38_latest_genomic.protein_coding.gtf.gz?dl=1',
'testlink': 'https://www.dropbox.com/s/8u7capwge0ihshu/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz?dl=1',
'testlink_tbi': 'https://www.dropbox.com/s/hdneg53t6u1j6ib/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz.tbi?dl=1'}
download reference using gwaslab¶
gl.download_ref("1kg_eas_hg19")
Tue Dec 13 17:17:19 2022 Start to download 1kg_eas_hg19 ... Tue Dec 13 17:17:19 2022 -Downloading to: /Users/he/opt/anaconda3/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz Tue Dec 13 17:24:21 2022 -Updating record in config file... Tue Dec 13 17:24:23 2022 -Updating record in config file... Tue Dec 13 17:24:23 2022 -Downloading to: /Users/he/opt/anaconda3/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi Tue Dec 13 17:24:23 2022 Downloaded 1kg_eas_hg19 successfully!
after downloading, use get_path to select the file path
mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True,vcf_path=gl.get_path("1kg_eas_hg19"))
Tue Dec 13 17:24:23 2022 Start to plot manhattan/qq plot with the following basic settings: Tue Dec 13 17:24:23 2022 -Genome-wide significance level is set to 5e-08 ... Tue Dec 13 17:24:23 2022 -Raw input contains 12557761 variants... Tue Dec 13 17:24:23 2022 -Plot layout mode is : r Tue Dec 13 17:24:23 2022 -Region to plot : chr7:156538803-157538803. Tue Dec 13 17:24:27 2022 -Extract SNPs in region : chr7:156538803-157538803... Tue Dec 13 17:24:58 2022 -Extract SNPs in specified regions: 5831 Tue Dec 13 17:24:59 2022 Finished loading specified columns from the sumstats. Tue Dec 13 17:24:59 2022 Start conversion and sanity check: Tue Dec 13 17:24:59 2022 -Removed 0 variants with nan in CHR or POS column ... Tue Dec 13 17:24:59 2022 -Removed 0 variants with nan in P column ... Tue Dec 13 17:24:59 2022 -P values are being converted to -log10(P)... Tue Dec 13 17:24:59 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Tue Dec 13 17:24:59 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Tue Dec 13 17:24:59 2022 -Maximum -log10(P) values is 7.948076083953893 . Tue Dec 13 17:24:59 2022 Finished data conversion and sanity check. Tue Dec 13 17:24:59 2022 Start to load reference genotype... Tue Dec 13 17:24:59 2022 -reference vcf path : /Users/he/opt/anaconda3/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz Tue Dec 13 17:25:01 2022 -Retrieving index... Tue Dec 13 17:25:01 2022 -Ref variants in the region: 35278 Tue Dec 13 17:25:01 2022 -Calculating Rsq... Tue Dec 13 17:25:01 2022 Finished loading reference genotype successfully! Tue Dec 13 17:25:01 2022 Start to create manhattan plot with 5831 variants: Tue Dec 13 17:25:01 2022 -Loading gtf files from:default Tue Dec 13 17:25:18 2022 -plotting gene track.. Tue Dec 13 17:25:19 2022 -Finished plotting gene track.. Tue Dec 13 17:25:19 2022 -Found 1 significant variants with a sliding window size of 500 kb... Tue Dec 13 17:25:19 2022 Finished creating Manhattan plot successfully! Tue Dec 13 17:25:19 2022 -Skip annotating
(<Figure size 3000x2000 with 4 Axes>, <gwaslab.Log.Log at 0x7fa69cfbd580>)
or you can provide your own vcf file. Let's annotate the lead variant this time.
mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True,anno=True,
vcf_path="/Users/he/work/gwaslab/src/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz")
Tue Dec 13 16:37:36 2022 Start to plot manhattan/qq plot with the following basic settings: Tue Dec 13 16:37:36 2022 -Genome-wide significance level is set to 5e-08 ... Tue Dec 13 16:37:36 2022 -Raw input contains 12557761 variants... Tue Dec 13 16:37:36 2022 -Plot layout mode is : r Tue Dec 13 16:37:36 2022 -Region to plot : chr7:156538803-157538803. Tue Dec 13 16:37:42 2022 -Extract SNPs in region : chr7:156538803-157538803... Tue Dec 13 16:38:27 2022 -Extract SNPs in specified regions: 5831 Tue Dec 13 16:38:29 2022 Finished loading specified columns from the sumstats. Tue Dec 13 16:38:29 2022 Start conversion and sanity check: Tue Dec 13 16:38:29 2022 -Removed 0 variants with nan in CHR or POS column ... Tue Dec 13 16:38:29 2022 -Removed 0 variants with nan in P column ... Tue Dec 13 16:38:29 2022 -P values are being converted to -log10(P)... Tue Dec 13 16:38:29 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Tue Dec 13 16:38:29 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Tue Dec 13 16:38:29 2022 -Maximum -log10(P) values is 7.948076083953893 . Tue Dec 13 16:38:29 2022 Finished data conversion and sanity check. Tue Dec 13 16:38:29 2022 Start to load reference genotype... Tue Dec 13 16:38:29 2022 -reference vcf path : /Users/he/work/gwaslab/src/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz Tue Dec 13 16:38:31 2022 -Retrieving index... Tue Dec 13 16:38:31 2022 -Ref variants in the region: 35278 Tue Dec 13 16:38:32 2022 -Calculating Rsq... Tue Dec 13 16:38:32 2022 Finished loading reference genotype successfully! Tue Dec 13 16:38:32 2022 Start to create manhattan plot with 5831 variants: Tue Dec 13 16:38:32 2022 Start to download recombination_hg19 ... Tue Dec 13 16:38:32 2022 -Downloading to: /Users/he/opt/anaconda3/lib/python3.8/site-packages/gwaslab/data/recombination/hg19/recombination_hg19.tar.gz Tue Dec 13 16:38:41 2022 -Updating record in config file... Tue Dec 13 16:38:41 2022 Downloaded recombination_hg19 successfully! Tue Dec 13 16:38:41 2022 -Loading gtf files from:default Tue Dec 13 16:38:41 2022 No records in config file. Please download first. Tue Dec 13 16:38:41 2022 Start to download ensembl_hg19_gtf ... Tue Dec 13 16:38:41 2022 -Downloading to: /Users/he/opt/anaconda3/lib/python3.8/site-packages/gwaslab/data/Homo_sapiens.GRCh37.87.chr.gtf.gz Tue Dec 13 16:40:46 2022 -Updating record in config file... Tue Dec 13 16:40:46 2022 Downloaded ensembl_hg19_gtf successfully! Tue Dec 13 16:41:13 2022 -plotting gene track.. Tue Dec 13 16:41:14 2022 -Finished plotting gene track.. Tue Dec 13 16:41:14 2022 -Found 1 significant variants with a sliding window size of 500 kb... Tue Dec 13 16:41:14 2022 Finished creating Manhattan plot successfully! Tue Dec 13 16:41:14 2022 -Annotating using column CHR:POS...
(<Figure size 3000x2000 with 4 Axes>, <gwaslab.Log.Log at 0x7fe1dbeb2580>)
Note: gwaslab default genome build version is build="19" (GRCh37/hg19), you can change it to build="38" (GRCh38/hg38) when needed. For gene tracks, default is gtf_path="ensembl" , you can also use gtf_path="refseq" (NCBA RefSeq)
Sampling¶
There are more than 10 million variants in the original sumstats and it will take long to process the entrie dataset. Since it might take a while to process the entire datasets, let us just random sample 1 million variants for this tutorial.
mysumstats.random_variants(n=100000,inplace=True)
Infer genome build¶
In case you don't know the genome build of the sumstats
For details, see: https://cloufield.github.io/gwaslab/InferBuild/
mysumstats.infer_build()
Fix_id¶
You may notice that the SNPID is in CHR:POS_REF_ALT format. We want SNPID to be in a stadardized format chr:pos:ref:alt, we can use fix_id for this:
For other options of standardization, see: https://cloufield.github.io/gwaslab/Standardization/
#fixsep : fix ID separator
mysumstats.fix_id(fixsep=True)
mysumstats.data
Harmonise¶
gwaslab can harmonize the sumstats based on reference files.
- ref_seq : reference genome fasta file for allele alignment
- ref_rsid_tsv : tsv file for annotation of common used variants
- ref_rsid_vcf : vcf file for annotation of other variants
- ref_infer : vcf file with allele frequency information for inferring strand and comparing allele frequency
- ref_alt_freq : field in INFO of vcf file for alternative allele frequency
For details see: https://cloufield.github.io/gwaslab/Harmonization/
For reference data, see: https://cloufield.github.io/gwaslab/Reference/
mysumstats.harmonize(basic_check=False,
n_cores=3,
ref_seq="/Users/he/Documents/Mydata/human_g1k_v37.fasta",
ref_rsid_tsv="/Users/he/Documents/Mydata/EAS_1kg_af_dbsnp151.ALL.tsv",
ref_rsid_vcf="/Users/he/Documents/Mydata/All_20180423.vcf.gz",
ref_infer="/Users/he/Documents/Mydata/eas_1kg_af/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz",ref_alt_freq="AF")
Wed Nov 2 13:28:01 2022 Start to check if NEA is aligned with reference sequence... Wed Nov 2 13:28:01 2022 -Current Dataframe shape : 100000 x 12 Wed Nov 2 13:28:01 2022 -Reference genome fasta file: /Users/he/Documents/Mydata/human_g1k_v37.fasta Wed Nov 2 13:28:01 2022 -Checking records: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT Wed Nov 2 13:28:39 2022 -Variants allele on given reference sequence : 41568 Wed Nov 2 13:28:39 2022 -Variants flipped : 50063 Wed Nov 2 13:28:39 2022 -Raw Matching rate : 91.63% Wed Nov 2 13:28:39 2022 -Variants inferred reverse_complement : 0 Wed Nov 2 13:28:39 2022 -Variants inferred reverse_complement_flipped : 0 Wed Nov 2 13:28:39 2022 -Both allele on genome + unable to distinguish : 8369 Wed Nov 2 13:28:39 2022 -Variants not on given reference sequence : 0 Wed Nov 2 13:28:39 2022 -Current Dataframe shape : 100000 x 12 Wed Nov 2 13:28:40 2022 Start to flip allele-specific stats for SNPs with status xxxxx[35]x: alt->ea , ref->nea ... Wed Nov 2 13:28:40 2022 -Flipping 50063 variants... Wed Nov 2 13:28:40 2022 -Swapping column: NEA <=> EA... Wed Nov 2 13:28:40 2022 -Flipping column: BETA = - BETA... Wed Nov 2 13:28:40 2022 -Flipping column: EAF = 1 - EAF... Wed Nov 2 13:28:40 2022 -Flipping column: DIRECTION +-? <=> -+? ... Wed Nov 2 13:28:40 2022 -Changed the status for flipped variants : xxxxx[35]x -> xxxxx[12]x Wed Nov 2 13:28:41 2022 Finished converting successfully! Wed Nov 2 13:28:41 2022 Start to infer strand for palindromic SNPs... Wed Nov 2 13:28:41 2022 -Current Dataframe shape : 100000 x 12 Wed Nov 2 13:28:41 2022 -Reference vcf file: /Users/he/Documents/Mydata/eas_1kg_af/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz Wed Nov 2 13:28:41 2022 -Alternative allele frequency in INFO: AF Wed Nov 2 13:28:42 2022 -Identified 14159 palindromic SNPs... Wed Nov 2 13:28:42 2022 -After filtering by MAF< 0.4 , the strand of 12908 palindromic SNPs will be inferred... Wed Nov 2 13:29:07 2022 -Non-palindromic : 76762 Wed Nov 2 13:29:07 2022 -Palindromic SNPs on + strand: 12520 Wed Nov 2 13:29:08 2022 -Palindromic SNPs on - strand and need to be flipped: 24 Wed Nov 2 13:29:08 2022 -Palindromic SNPs with maf not availble to infer : 1251 Wed Nov 2 13:29:08 2022 -Palindromic SNPs with no macthes or no information : 175 Wed Nov 2 13:29:09 2022 -Identified 8369 indistinguishable Indels... Wed Nov 2 13:29:09 2022 -Indistinguishable Indels will be inferred from reference vcf ref and alt... Wed Nov 2 13:29:36 2022 -Indels ea/nea match reference : 3538 Wed Nov 2 13:29:37 2022 -Indels ea/nea need to be flipped : 4531 Wed Nov 2 13:29:37 2022 -Indels with no macthes or no information : 300 Wed Nov 2 13:29:37 2022 -Current Dataframe shape : 100000 x 12 Wed Nov 2 13:29:38 2022 Start to flip allele-specific stats for standardized indels with status xxxx[123][67][6]: alt->ea , ref->nea ... Wed Nov 2 13:29:39 2022 -Flipping 4531 variants... Wed Nov 2 13:29:39 2022 -Swapping column: NEA <=> EA... Wed Nov 2 13:29:39 2022 -Flipping column: BETA = - BETA... Wed Nov 2 13:29:39 2022 -Flipping column: EAF = 1 - EAF... Wed Nov 2 13:29:39 2022 -Flipping column: DIRECTION +-? <=> -+? ... Wed Nov 2 13:29:39 2022 -Changed the status for flipped variants xxxx[123][67]6 -> xxxx[123][67]4 Wed Nov 2 13:29:39 2022 Start to flip allele-specific stats for palindromic SNPs with status xxxxx[12]5: (-)strand <=> (+)strand ... Wed Nov 2 13:29:40 2022 -Flipping 24 variants... Wed Nov 2 13:29:40 2022 -Flipping column: BETA = - BETA... Wed Nov 2 13:29:40 2022 -Flipping column: EAF = 1 - EAF... Wed Nov 2 13:29:40 2022 -Flipping column: DIRECTION +-? <=> -+? ... Wed Nov 2 13:29:40 2022 -Changed the status for flipped variants: xxxxx[012]5: -> xxxxx[012]2 Wed Nov 2 13:29:40 2022 Finished converting successfully! Wed Nov 2 13:29:40 2022 Start to annotate rsID based on chromosome and position information... Wed Nov 2 13:29:40 2022 -Current Dataframe shape : 100000 x 12 Wed Nov 2 13:29:40 2022 -SNPID-rsID text file: /Users/he/Documents/Mydata/EAS_1kg_af_dbsnp151.ALL.tsv Wed Nov 2 13:29:40 2022 -100000 rsID could be possibly fixed... Wed Nov 2 13:29:40 2022 -Setting block size: 5000000 Wed Nov 2 13:29:40 2022 -Loading block: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Wed Nov 2 13:35:24 2022 -rsID Annotation for 3296 need to be fixed! Wed Nov 2 13:35:24 2022 -Annotated 96704 rsID successfully! Wed Nov 2 13:35:24 2022 Start to assign rsID using vcf... Wed Nov 2 13:35:24 2022 -Current Dataframe shape : 100000 x 13 Wed Nov 2 13:35:24 2022 -CPU Cores to use : 3 Wed Nov 2 13:35:24 2022 -Reference VCF file: /Users/he/Documents/Mydata/All_20180423.vcf.gz Wed Nov 2 13:35:24 2022 -Assigning rsID based on chr:pos and ref:alt/alt:ref... Wed Nov 2 13:35:41 2022 -rsID Annotation for 618 need to be fixed! Wed Nov 2 13:35:41 2022 -Annotated 2678 rsID successfully! Wed Nov 2 13:35:41 2022 Start to sort the genome coordinates... Wed Nov 2 13:35:41 2022 -Current Dataframe shape : 100000 x 13 Wed Nov 2 13:35:41 2022 -Force converting POS to integers... Wed Nov 2 13:35:41 2022 -Sorting genome coordinates... Wed Nov 2 13:35:42 2022 Finished sorting genome coordinates successfully! Wed Nov 2 13:35:42 2022 Start to reorder the columns... Wed Nov 2 13:35:42 2022 -Current Dataframe shape : 100000 x 13 Wed Nov 2 13:35:42 2022 -Reordering columns to : SNPID,rsID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS Wed Nov 2 13:35:42 2022 Finished sorting columns successfully!
<gwaslab.Sumstats.Sumstats at 0x7fee8400ff10>
Check the data again. Looks good!
mysumstats.data
| SNPID | rsID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1:854168:C:T | rs79188446 | 1 | 854168 | T | C | 0.1484 | -0.0292 | 0.0156 | 0.06173 | 166718 | -?-+ | 1960010 |
| 1 | 1:856510:C:T | rs191530247 | 1 | 856510 | T | C | 0.0045 | 0.0604 | 0.1278 | 0.63650 | 166718 | +?+- | 1960010 |
| 2 | 1:981131:A:G | rs9697293 | 1 | 981131 | G | A | 0.0330 | 0.0183 | 0.0273 | 0.50200 | 166718 | -?+- | 1960000 |
| 3 | 1:1021408:G:A | rs11260587 | 1 | 1021408 | A | G | 0.0010 | -0.1388 | 0.2623 | 0.59680 | 166718 | -?+- | 1960010 |
| 4 | 1:1038997:ACTC:A | rs71576598 | 1 | 1038997 | A | ACTC | 0.2037 | 0.0162 | 0.0111 | 0.14410 | 191764 | +++- | 1960364 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 99995 | X:154429214:A:G | rs4893075 | 23 | 154429214 | G | A | 0.1802 | 0.0063 | 0.0096 | 0.51160 | 191764 | +-+- | 1960000 |
| 99996 | X:154452019:G:A | rs113007852 | 23 | 154452019 | A | G | 0.0143 | 0.0374 | 0.0437 | 0.39210 | 191764 | +-++ | 1960010 |
| 99997 | X:154682147:C:A | rs5940511 | 23 | 154682147 | A | C | 0.7852 | -0.0081 | 0.0084 | 0.33460 | 191764 | -+-+ | 1960010 |
| 99998 | X:154740552:T:C | rs5940461 | 23 | 154740552 | C | T | 0.7972 | -0.0086 | 0.0090 | 0.33840 | 191764 | -+-+ | 1960000 |
| 99999 | X:154803312:C:T | rs781981622 | 23 | 154803312 | T | C | 0.6074 | -0.0080 | 0.0115 | 0.48580 | 191764 | -+-+ | 1960010 |
100000 rows × 13 columns
Check the summary of the currrent sumstats (see: https://cloufield.github.io/gwaslab/StatusCode/):
mysumstats.summary()
| Values | Percentage | ||
|---|---|---|---|
| Category | Items | ||
| META | Row_num | 100000 | NaN |
| Column_num | 6 | NaN | |
| Column_names | SNPID,rsID,EAF,P,N,STATUS | NaN | |
| Last_checked_time | Wed Nov 2 13:35:42 2022 | NaN | |
| MISSING | Missing_total | 618 | 0.62 |
| Missing_rsID | 618 | 0.62 | |
| MAF | Common | 50567 | 50.57 |
| Low_frequency | 17297 | 17.30 | |
| Rare | 32073 | 32.07 | |
| P | Minimum | 6.973e-74 | 0.00 |
| Significant | 73 | 0.07 | |
| Suggestive | 140 | 0.14 | |
| STATUS | 1960010 | 42746 | 42.75 |
| 1960000 | 34016 | 34.02 | |
| 1960011 | 6324 | 6.32 | |
| 1960001 | 6196 | 6.20 | |
| 1960364 | 4531 | 4.53 | |
| 1960363 | 3538 | 3.54 | |
| 1960007 | 629 | 0.63 | |
| 1960017 | 622 | 0.62 | |
| 1960309 | 529 | 0.53 | |
| 1960368 | 300 | 0.30 | |
| 1960008 | 189 | 0.19 | |
| 1960319 | 181 | 0.18 | |
| 1960018 | 175 | 0.18 | |
| 1960012 | 15 | 0.02 | |
| 1960002 | 9 | 0.01 |
Formatting and saving : to_format()¶
You can easily format the processed sumstats and save it.
For details see: https://cloufield.github.io/gwaslab/Format/
mysumstats.to_format("clean_sumstats",fmt="ldsc")
Wed Nov 2 13:35:42 2022 Start to format the output sumstats in: ldsc format
Wed Nov 2 13:35:42 2022 -Formatting statistics ...
Wed Nov 2 13:35:42 2022 - Float statistics formats:
Wed Nov 2 13:35:42 2022 - Columns: ['EAF', 'BETA', 'SE', 'P']
Wed Nov 2 13:35:42 2022 - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}']
Wed Nov 2 13:35:42 2022 - Start outputting sumstats in ldsc format...
Wed Nov 2 13:35:42 2022 -ldsc format will be loaded...
Wed Nov 2 13:35:42 2022 -ldsc format meta info:
Wed Nov 2 13:35:42 2022 - format_name : ldsc
Wed Nov 2 13:35:42 2022 - format_source : https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format
Wed Nov 2 13:35:42 2022 - format_source2 : https://github.com/bulik/ldsc/blob/master/munge_sumstats.py
Wed Nov 2 13:35:42 2022 - format_version : 20150306
Wed Nov 2 13:35:42 2022 -gwaslab to ldsc format dictionary:
Wed Nov 2 13:35:42 2022 - gwaslab keys: ['rsID', 'NEA', 'EA', 'N', 'BETA', 'P', 'INFO', 'OR']
Wed Nov 2 13:35:42 2022 - ldsc values: ['SNP', 'A2', 'A1', 'N', 'Beta', 'P', 'INFO', 'OR']
Wed Nov 2 13:35:42 2022 -Output columns: Index(['SNP', 'A1', 'A2', 'Beta', 'P', 'N'], dtype='object')
Wed Nov 2 13:35:42 2022 -Output path: clean_sumstats.ldsc.tsv.gz
Wed Nov 2 13:35:44 2022 -Saving log file: clean_sumstats.ldsc.log
Wed Nov 2 13:35:44 2022 Finished outputting successfully!
Liftover¶
If the sumstats need liftover:
mysumstats.liftover(n_cores=1,from_build="19", to_build="38")
Wed Nov 2 13:49:15 2022 Start to perform liftover... Wed Nov 2 13:49:15 2022 -Current Dataframe shape : 100000 x 13 Wed Nov 2 13:49:15 2022 -CPU Cores to use : 1 Wed Nov 2 13:49:15 2022 -Performing liftover ... Wed Nov 2 13:49:15 2022 -Creating converter : hg19 to hg38 Wed Nov 2 13:49:16 2022 -Converting variants with status code xxx0xxx :100000... Wed Nov 2 13:49:50 2022 -Removed unmapped variants: 53 Wed Nov 2 13:49:52 2022 Finished liftover successfully!
mysumstats.data
| SNPID | rsID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1:854168:C:T | rs79188446 | 1 | 918788 | T | C | 0.1484 | -0.0292 | 0.0156 | 0.06173 | 166718 | -?-+ | 3860099 |
| 1 | 1:856510:C:T | rs191530247 | 1 | 921130 | T | C | 0.0045 | 0.0604 | 0.1278 | 0.63650 | 166718 | +?+- | 3860099 |
| 2 | 1:981131:A:G | rs9697293 | 1 | 1045751 | G | A | 0.0330 | 0.0183 | 0.0273 | 0.50200 | 166718 | -?+- | 3860099 |
| 3 | 1:1021408:G:A | rs11260587 | 1 | 1086028 | A | G | 0.0010 | -0.1388 | 0.2623 | 0.59680 | 166718 | -?+- | 3860099 |
| 4 | 1:1038997:ACTC:A | rs71576598 | 1 | 1103617 | A | ACTC | 0.2037 | 0.0162 | 0.0111 | 0.14410 | 191764 | +++- | 3860399 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 99995 | X:154429214:A:G | rs4893075 | 23 | 155200937 | G | A | 0.1802 | 0.0063 | 0.0096 | 0.51160 | 191764 | +-+- | 3860099 |
| 99996 | X:154452019:G:A | rs113007852 | 23 | 155223738 | A | G | 0.0143 | 0.0374 | 0.0437 | 0.39210 | 191764 | +-++ | 3860099 |
| 99997 | X:154682147:C:A | rs5940511 | 23 | 155452486 | A | C | 0.7852 | -0.0081 | 0.0084 | 0.33460 | 191764 | -+-+ | 3860099 |
| 99998 | X:154740552:T:C | rs5940461 | 23 | 155510891 | C | T | 0.7972 | -0.0086 | 0.0090 | 0.33840 | 191764 | -+-+ | 3860099 |
| 99999 | X:154803312:C:T | rs781981622 | 23 | 155573651 | T | C | 0.6074 | -0.0080 | 0.0115 | 0.48580 | 191764 | -+-+ | 3860099 |
99947 rows × 13 columns
Note: Gwaslab only liftover CHR and POS, and when lifted, the last two digits status code will be rolled back to 99. Since for difference reference genome, the reference allele or strand might be reverse, so it is need to align and check agin.
For details, see https://cloufield.github.io/gwaslab/LiftOver/